In this EDA (Exploratory Data Analysis) project I will explore a dataset of the 2016 elections’ financial contributions, while examining its structure, variable, patterns and relationships between those variables. I will start with few one-variable explorations, check their distributions and then move to find relationships between two or more variables.
The first variable will be the ‘amount’ variable, which is the money a contributor donated to one or more of the candidates and is the only vector in the downloaded dataset that is not a character, rather a numeric vector.
!Important to note that the ‘finance’ dataset uploaded here was ‘munged’ in a different file called ‘all-munge.R’. The original dataset had 19 columns, which I found some of them less interesting for the scope of this project, so I removed them.
I also, in this file (‘all-munge.R’):
Changed a few of the variable names; shortened the candidates names to have only their last name; restricted the data to only the primaries and general elections of 2016; removed all the donated amounts that had minus (-); added a column to represent the candidate party affiliation; Added a column with the gender of the contributor based on a pre-defined database that I downloaded to my computer; added a new column with the day and the year, extracted from the contributions’ date column; which, all in all, ended up as a better orgnized dataset to work with when doing the EDA.
First, lets take a look at the dataset, get familiar with the variables and ask questions about the data.
## [1] 7353272 14
We can see above that now the data set has 7,372,235 million observations of contributions contributed, with 13 different columns that correspond to each observation. The columns are:
“cand_id” - candidate ID
“candidate” - Candidate name
“contributor” - Contributor name
“city” - Contributor city
“state” - Contributor state
“zipcode” - Contributor zipcode
“employer” - Contributor employer
“occupation” - Contributor occupation
“amount” - Amount contributed
“date” - Contribution transaction date
“election_tp” - Election type (General or Primaries)
“party” - The political party of the candidate
“gender” - The contributor’s gender
## cand_id candidate contributor
## P00003392:3506373 Clinton:3506373 TRUITT, ROBERTA : 1520
## P60007168:2042624 Sanders:2042624 BODNICK, KATIE : 1313
## P80001571: 746616 Trump : 746616 AMISIAL, WILFRID: 1078
## P60006111: 543405 Cruz : 543405 PURCELL, LARRY : 722
## P60005915: 246313 Carson : 246313 SMITH, DAVID : 686
## P60006723: 98957 Rubio : 98957 WILLIAMS, JAMES : 682
## (Other) : 168984 (Other): 168984 (Other) :7347271
## city state zipcode
## NEW YORK : 204204 california :1294446 Min. : 0
## LOS ANGELES : 102524 new york : 640831 1st Qu.:20837
## SAN FRANCISCO: 90577 texas : 539992 Median :53095
## WASHINGTON : 90229 florida : 420024 Mean :52971
## BROOKLYN : 87279 washington : 293342 3rd Qu.:89141
## SEATTLE : 83549 massachusetts: 279133 Max. :99999
## (Other) :6694910 (Other) :3885504 NA's :390
## employer occupation
## N/A : 995348 RETIRED :1642509
## RETIRED : 908452 NOT EMPLOYED : 626063
## SELF-EMPLOYED : 535009 INFORMATION REQUESTED: 239700
## NONE : 452810 ATTORNEY : 199767
## NOT EMPLOYED : 265417 TEACHER : 141592
## INFORMATION REQUESTED: 239965 PHYSICIAN : 111942
## (Other) :3956271 (Other) :4391699
## amount date tran_id
## Min. : 0 Min. :2013-10-01 A4EA7F7D9338943869B5: 8
## 1st Qu.: 15 1st Qu.:2016-03-02 AA2F3125A0DB141928EB: 8
## Median : 28 Median :2016-05-27 AAC874DDA3EA04584A39: 8
## Mean : 127 Mean :2016-05-19 AB37264C070244DDDBF7: 8
## 3rd Qu.: 92 3rd Qu.:2016-09-04 SA17A.4143 : 7
## Max. :4904861 Max. :2016-12-31 A1F4C793991D1416D939: 6
## (Other) :7353227
## election_tp party gender
## G2016:2607976 Democrat :5556219 female:3703574
## P2016:4745296 Green : 9033 male :3649698
## Independent: 1289
## Republican :1786731
##
##
##
Many different interesting points about the data can be seen in the above table. It seems that Hillary Clinton, under the ‘candidate’ column, had the highest number of occurrences, followed by Bernie Sanders and Donald Trump. Did she also lead with the total amount of contributions and not only the number of contributions?
Other things we can see in this first glance at the dataset with the number of distributions are:
New York is the leading city with 204,204 contributions; California is the leading state with highest number of contributions (1,294,446); Retired people take the first and second places with number of contributions under the ‘occupation’ and ‘employer’ variables; The Democratic party had about 4 times more contributions than the Republican party (5,556,219 / 1,786,731); The amounts donated to all parties started from few cents and reached 4,904,861, which was made by a single contributor. I wonder who that was.
I will focus on only few of the questions and variables above in the scope of this project and drill down where there is a need to understand better the distributions and connections between the variables.
Let’s see first how much money was contributed in these elections by all contributors.
## [1] 932698768
The sum of all contributions to all candidates in 2016 elections was $932,698,768, as recorded in this dataset which was downloaded from the fcc.gov website.
The first (left) plot seems to be a non-descriptive one, but in fact we can learn a few basic things about the ‘amount’ distribution from it. First, we can see the huge gap between the highest and lowest contributions, following the x axis amounts. Second, we can see from the first plot that most of the contributions were not very far from 0 and, for sure, not in the millions. Looking at the second plot above we can see that indeed most of the contributions were below $200, after dropping the top 3% of the contributions. The median contribution in this distribution is $28. I will split the amount donated to big and small donors on the $200 mark and check which candidates were suported by big and small contributors.
With less outliers and variability, it is easier to look at the data and its distribution in what seems now like a normal distribution.
We had 18 Republicans, 5 Democrats, 1 Independent and 1 Green, out of 25 candidates in 2016 elections. Republicans outnumbered the Democrats 3 times and 18 times the Green and Independent parties.
There are many questions that this party map of candidates brings up. First, why do the Republicans have so many more candidates than the other next big party? Is it saying that the Republicans are more open to bring different perspectives into their lines and the Democrats are more inclusive?
Another obvious point is that there were mainly two parties competing in these elections, where the small ones seemed to had a very slim chance of winning. This is not just because of the minimal representation by candidates, it is also because the respectively small amounts that were collected by those parties, compare to the two big ones, which will be demostrated later on.
The American political system has been based on two-system-party since its inception, with the Federalists and the Democratic-Republican Parties, until today with the Democratic and Republican parties. An interesting question for further investigation can be, what are the chances of third party to be counted in the American political system, and can we learn this from the available data?
##
## 25 50 100 10 5 15 27 250 35
## 1062248 886998 781859 644737 440134 330595 315510 274761 150672
## 20
## 143209
The most frequent contributions were 25, 50, 100, 10, 15, 5, 27, 250, 35 and 20. Interestingly enough, 9 out of the 10 amounts are multiplications of 5. The 7th amount in line is $27. This is the number that Bernie Sanders’ campaign advertised as their most popular contribution.
Looking at the different histograms, Hillary Clinton seems to lead with number of contributions, followed by Sanders and Trump. It is not really clear from this plot who are the next ones in decending order. It seems that it can be Rubio, Cruz, Bush or Carson. I will dive into who really received the highest number of contributions and who received the highest amount of contributions.
The bar plot says it all. Clinton lead these elsections with the number of contributions, followed by Sanders, Trump, Cruz, Carson, Rubio, Paul, Fiorina, Bush and Kasich, in this order. So, how many contributions exactly each of the top 10 candidates received?
## # A tibble: 25 x 3
## candidate contributions percent
## <chr> <int> <dbl>
## 1 Clinton 3506373 47.7
## 2 Sanders 2042624 27.8
## 3 Trump 746616 10.2
## 4 Cruz 543405 7.39
## 5 Carson 246313 3.35
## 6 Rubio 98957 1.35
## 7 Paul 31564 0.43
## 8 Fiorina 27615 0.38
## 9 Bush 27487 0.37
## 10 Kasich 25238 0.34
## 11 Johnson 13341 0.18
## 12 Stein 9033 0.12
## 13 Walker 6552 0.09
## 14 Huckabee 6396 0.09
## 15 Christie 5786 0.08
## 16 O'Malley 5088 0.07
## 17 Graham 3725 0.05
## 18 Santorum 1677 0.02
## 19 Lessig 1344 0.02
## 20 McMullin 1289 0.02
## 21 Perry 896 0.01
## 22 Webb 790 0.01
## 23 Jindal 764 0.01
## 24 Pataki 323 0
## 25 Gilmore 76 0
Hillary Clinton received 48% of the total contributions, followed by Bernie Sanders with 27% and then Donald Trump with only 10% of the total contributions in both the primaries and the general elections. Hillary Clinton received 4.5 times more contributions than Donald Trump, yet it did not help her to win the race.
## # A tibble: 25 x 4
## candidate contributions sum percent
## <chr> <int> <dbl> <dbl>
## 1 Clinton 3506373 482645402. 51.8
## 2 Trump 746616 121743709. 13.0
## 3 Sanders 2042624 93986274. 10.1
## 4 Cruz 543405 69484768. 7.45
## 5 Rubio 98957 39876120. 4.28
## 6 Bush 27487 32972723. 3.54
## 7 Carson 246313 28869741. 3.1
## 8 Kasich 25238 14685195. 1.57
## 9 Christie 5786 8033999. 0.86
## 10 Fiorina 27615 6714037. 0.72
## # ... with 15 more rows
As we can see, only 8 candidate out of the 25 had more than 1% of the sum of all contributions. Hillary Clinton received 52% of the contributions, followed by Donald Trump with 13% and Bernie Sanders with 11%.
## # A tibble: 1,307,046 x 4
## contributor count average sum
## <chr> <int> <dbl> <dbl>
## 1 HILLARY VICTORY FUND - UNITEMIZED 14 3090797. 43271164
## 2 SMITH, MICHAEL 544 177. 96286.
## 3 MILLER, MICHAEL 520 171. 88931.
## 4 BOCH, ERNIE 1 86937. 86937.
## 5 SMITH, JAMES 454 174. 79205.
## 6 SMITH, WILLIAM 601 123. 73901.
## 7 SMITH, DAVID 686 102. 69864.
## 8 WILLIAMS, DAVID 422 165. 69525.
## 9 BROWN, MICHAEL 362 188. 67997.
## 10 SMITH, ROBERT 543 121. 65825.
## # ... with 1,307,036 more rows
In 2016 elections rich donors could contribute as much as $360,000. With Hillary Clinton’s campaign. That’s how it worked: Donors who were rich - and willing - could give $5,400 to the Clinton campaign, $33,400 to the Democratic National Committee and $10,000 to each of the state parties (32 with Democratic committees), about $350,000 in all. A joint fundraising committee gave the donor do it all with a single check.
On Jan. 1, the contribution limits reset for the party committees, and the Hillary Victory Fund could go back to its donors for another $350,000 in party funds.
While the maximum donation to a presidential campaign was $2,700 for the primary elections (plus another $2,700 for the general), the Hillary Victory Fund could accept much larger contributions because it was a so-called joint fundraising committee comprised of multiple committees.
So, the Hillary Victory Fund was a fake contributor, and an extreme outlier, in our data. The lack of information about the real contributors must have some kind of influence on one or more analysis of the variables looked at in this project. The HVF funneled big amounts of money for Hillary Clinton’s campaign, using the states’ committees as a legal stamp to send money way and back to reach the maximum amount per donor, leaving only 1% of the contributions to the state’s committees. As a result, we do not know from the data we have, which is the government’s official 2016 contributions database, who gave and how much they gave to Clinton, from her biggest donors. Democratic donors, knowing the funds would end up with Clinton’s campaign, wrote six-figure checks to influence the election - 100 times larger than allowed. (from investor.com)
The actual big contributors, that were masked by the HVF, like Google, Facebook, JPMorgan Chase & Co, Stanford University, US Dept of State and others, can be found here.
35,209 people contributed to more than 1 candidate, out of 1,307,046 recorded unique contributors, which is 2.7%. We can see that as the number of candidates goes up, the number of donors goes down, which seems logical. Who were the donor who contributed to maximum number of candidates?
## # A tibble: 6 x 4
## # Groups: contributor [6]
## contributor city candidates sum
## <chr> <chr> <int> <dbl>
## 1 WILSON, KIRK DALLAS 9 11730.
## 2 CALABRESI, STEVEN PROVIDENCE 8 24300
## 3 DRUMMOND, SARA MONTALBA 8 6700
## 4 AGRON, DOMINICK DINGMANS FERRY 7 4154.
## 5 FRIESS, FOSTER MR. JACKSON 7 18900
## 6 BRYANT, GORDON BEAUFORT 6 2025
Wilson Kirk, from Dallas, Texas (there were couple of Wilson Kirks in this database), was the one to donate to maximum number of candidates, 9 in number. Let’s see some more information about him and his contributions with a plot.
Wilson Kirk, in 2015, contributed first to Fiorina and Huckabee and ended with Bush and Christie, while giving Bush 3 times. He then halted his contributions until the end of November, when he gave Trump twice. I wonder, as an obvious Republican supporter, why didn’t he give to Trump throughout 2016?
I will look now into Hillary Clinton’s well-known claim that her campaign relied on small donations (less than $100). I went ahead, doubled the number and splitted the data on the $200 mark (as other sources suggested), as the point that separates big and small donors.
##
## above $200 below $200
## 326670 136347
As we can see above, Clinton had almost 2.5 times more contributions above $200 and not as she claimed. I wonder what is the ratio for Trump and Sanders, who were her two main opponents in the two elections.
##
## above $200 below $200
## 110867 382719
Trump had almost 3.5 times more small donors than big donors!
##
## above $200 below $200
## 118399 103102
Sanders’ had almost the same number of small and big contributors. He had 1.1 more big donors than small ones.
Let’s see the distribution of contributions above and below $200 for all candidates in a graph.
It seems that every candidate received more money from ‘big donors’ than small ones in 2016’s elections, except Donald Trump. Trump by far passed the rest of the candidates with samll donors contributions. Hillary, on the other hand, was the biggest consumer of big donations, while Sanders, Cruz and Carson receive more balanced ratio of contributinos from small and big donors.
Working on the above data, I noticed that some people contributed more than once. Let’s see who they were.
## # A tibble: 6 x 6
## # Groups: contributor [6]
## contributor candidate count average sum split_200
## <chr> <chr> <int> <dbl> <dbl> <chr>
## 1 TRUITT, ROBERTA Clinton 1520 1 1520 above $200
## 2 BODNICK, KATIE Clinton 1313 4 5465. above $200
## 3 AMISIAL, WILFRID Clinton 1078 3 3526. above $200
## 4 PURCELL, LARRY Sanders 705 4 3138. above $200
## 5 SAUNDERS, ELIZABETH Clinton 675 6 4324. above $200
## 6 SCHWARTZ, HILARY Clinton 622 7 4429. above $200
Wow! Some people contributed hudreds of times. Truitt Roberta, as the leader on this plot, donated 1,520 times with average of $1, and she gave to the Clinton campaign. There can be many reasons for that. It can be an automaed system that does the online contributions for a person, or an army of trolls who pump-up the number of contributinos for their candidate. An interesting question here for me is who was the candidate that had the highest number of repeating contributors? I will consider here that extreme-repeating contributors as ones who donated more than 100 times.
## # A tibble: 8 x 3
## candidate sum_count average
## <chr> <int> <dbl>
## 1 Clinton 200998 149.
## 2 Sanders 77718 137.
## 3 Cruz 8676 142.
## 4 Trump 395 132.
## 5 Johnson 243 122.
## 6 Rubio 217 108.
## 7 Fiorina 107 107
## 8 Carson 104 104
Hilary Clinton was ahead of everyone else with more than 200K of ‘extreme contributions’, followed by Sanders with 75K. The number at the top of the bars is the average number of repeating contributors per extreme donor.
## # A tibble: 13 x 4
## occupation number sum percent
## <chr> <int> <dbl> <dbl>
## 1 RETIRED 1642509 163191821. 22.3
## 2 NOT EMPLOYED 626063 31419941. 8.5
## 3 INFORMATION REQUESTED 239700 37880243. 3.3
## 4 ATTORNEY 199767 51658327. 2.7
## 5 TEACHER 141592 7936464. 1.9
## 6 PHYSICIAN 111942 19291577. 1.5
## 7 HOMEMAKER 108421 30022237. 1.5
## 8 PROFESSOR 102188 10125557. 1.4
## 9 CONSULTANT 86321 16469980. 1.2
## 10 ENGINEER 76261 8311212. 1
## 11 SALES 62750 5817429. 0.9
## 12 LAWYER 56398 14809384. 0.8
## 13 MANAGER 54675 6864216. 0.7
This chart above cannot tells us much since there are about 120,000 occupations that donors added to their contribution forms. The text in the field was open to insert any characters without restriction, thus many occupations were writen many times in different variations
In order to analyze this facet of the dataset, we will have to write an algorithm that searches for similar terms and combine them together.
Nevertheless, in the above chart the percent of retired donors is pretty impressive, compare to them being 14.5% of the population in 2016.
Also interesting to see here is the high percent of donors who filled ‘unemployed’ at that time. I would think unemployed people won’t have the money to donate, but they did, in their ten thousands.
Women had a slight lead with the number of contributions.
## # A tibble: 2 x 2
## gender contributions
## <chr> <int>
## 1 female 3703574
## 2 male 3649698
Women contributed 3,712,479 times and men contributed 3,661,116 time. Interesting to note here that women also voted more than men in those elections. not only contributed more. By the Center for American Women and Politics, since 1964, women voted more than men in every election.
Source: Center for American Women and Politics
Why did women vote or contributed more than men? Maybe it is related to the fact that there were 51% women and 49% men in the US in 2016? That is a very interesting question to study in further research about women involvemnt in political issues, which, unfortunatelly, is out of the scope of this project..
Red and blue lines, respectivly, are the Republican and Democratic primaries and the green line is the general election.
People started to donate already in 2014, but in very small numbers, as can be seen further down. Most of the donors started contributing in early 2015 and until November 2016. Some kept on giving even after the elections, but it died after January 2017. We can see a steady built-up of the amount donated leading to the highest amounts given in the months and days before the general election. There was a pick of contributions between February and June of 2016 and a drop right after. This might be the related to the Republican and Democratic primaries that took place between January 1st 2016 and Jan 15th 2016.
Here is an interactive map that I used to look for interesting patterns, followed by some insights.
Zoom in by selecting a range of dates with your mouse. To zoom out double click on the graph.Americans do not (or do very little) donate on Saturdays and Sundays, as can be seen when zooming to a week level on the above graph. This pattern is consistant throughout the elections cycle.
Zooming in on the fartherst left side of the graph, we can see that there were contributions given as early as late 2013 (and not 2014 as assumed above). It seems that the bigger numbers of donations started kicking-in somewhere in mid-lat March 2015. Interestingly, less than a month later was the day that Hillary Clinton announced her run officially. Also, it seems that after Trump’s announcement, on June 16th 2016, there was an increase in donations as well.
Another very interesting pattern is revealed when zooming in to a month level. The sum of donations for the last day(s) of each month had a sharp increase in donations, compare to the rest of the month and can be seen as the spikes along the x axis.
Let’s take a look at the early donations and who received them.
Now, looking at the distribution of the donations, the voting pattern looks clearer. Donations were mostly given prior to an election. The assumption that the contributions peak we saw in the previous plot between February and June 2016 is related to the primaries, was correct.
## # A tibble: 8 x 3
## # Groups: gender [2]
## gender party num_contrib
## <chr> <chr> <int>
## 1 female Democrat 3036176
## 2 male Democrat 2520043
## 3 male Republican 1122975
## 4 female Republican 663756
## 5 male Green 6028
## 6 female Green 3005
## 7 male Independent 652
## 8 female Independent 637
Looking at the above faceted data, the trend we saw earlier with growing contributions over time and closer to the general elections, is missing from the Republican party. There actually seems to be trend down towards the General elections, on the Republican side.
Women contributed 1.2 times more than men for the Democratic party. At the other side of the asile, the Republican men contributed 1.8 times more than women to any candidate. The Green party had even wider gap between men and women’s number of contributions. Men contributed twice as much as women to that party. The Independent party was the only one to have almost identical number of contributions from men and women. We can also see here that Democrats received the highest number of contributions. Did they also received the highest amount of contributions?
## # A tibble: 4 x 2
## party sum_contrib
## <chr> <dbl>
## 1 Democrat 581598736.
## 2 Republican 349620307.
## 3 Green 1132327.
## 4 Independent 347398.
Democrats received $584M, almost twice as much as the Repubicans.
Women donated to Clinton 1.5 times more than men, and man donated to Trump 1.7 time more than women. It seems that the gender’s role with contributions to those two candidates was pretty dominant.
3 out of the 4 candidates who received early donations were Republicans; Cruz, Paul and Rubio. Rubio was the only one who received contributions in 2013 and most 2014. Did starting early helped Rubio? Let’s see how much money each candidate collected along the way to the elections.
(active chart)
Looking at the map above and focusing on my area (Silicon Valley, Ca) I can clearly see how the richer cities contributed more money, regardeless of their size. Leading the state contributions are California with $160M, New York with $130M, Texas with $85M and Florida with $62M. As far as cities, Palm Beach pops up first with the size of the red dot and the dark color. Did the amount contributed from each state reflect the size of its population? Let’s take a look at it with 2 charts. In further investigation here I would add few variables, like gender and party, and try to find hints for relationships between all variables.
We can see that there is a very strong correlation (0.935) between the number of contributions per state and the number of citizens in this state. In a further investigation I would analyze the correlation between cities and their financial contributions to the different parties.
Exploring 2016 elections’ contributions taught me a lot of things I was not aware of, despite them being publically available and me being an avid follower of politics. I started the project with a dataset of California financial contributions, the state I live in, but found it lacking data that is available on the national level, knowing that I could always go back and drill down into the state’s data. It seemed like more of a challange to work with the national dataset and it indeed was exactly that. Choosing to work with more than 7 million rows on a laptop was at the beginning very difficult and especially time consuming, but with time I found better ways and tools to work with for a given task. For example, I experiencd issues with dplyr and knittr, so I moved to work with sqldf, which was much slower, and ended up working with the built-in r function, as with the “Percent of contributions per gender” block above. By the time I ended this project I made most of the calls in the code with dplyr. In order to improve the workflow, I also created a sample file, which I used to run on more time-consuming code blocks.
As far as exploring one, two and multi-variables, I found that it is necessary sometimes to add a two variable plot or explanation right after exploring one variable, for the sake of continuation and readability. Eample is the #1 multi-candidate contributor plot that drilled down on the list of contributors who donated to more than one candidate.
Naturally, the challanging part and the part that took the longest time was the data wrangling, which I saved in a separate file called all-munge.R. This file outputs a clean dataset with all the 2016 elections’ contributions, which saved me tons of time of running the entire script each time I closed the program (Rstudio) if when it crashed.
Data can be missleading if it is not connected to the real life events that produced the topic being investigated. For example, the clinton campaign had many contributions from many contributors coming in, represented by only one name (Hillary Victory Foundation). The ‘contributor’ HVF was an outlier that skewed the data. On another hand, to remove this ‘contributor’ from te list means to remove the sums of the donations that this HVF encompases. For example,
External datasets. In order to complete missing information on the dataset, I used data from different sources. For the US population and states information, like zipcodes, longitude and latitude, I used cencus.gov. The cities data was taken from simplemaps.com.
Removing and adding new variables. I found through this project that, on one hand, you want to minimize the length of columns for the sake of speed, and on the other hand, you find that those same variables can be meaningful farther down the analysis. I had to go back and recreate the program, adding the old variables back to the dataset.
If i’ll need to do this project again I will use the Wine dataset. this dataset has 12 numeric vectors that are straight forward for correlation analysis, compare to 1 in the 2016 Elections dataset I explored here. Nevertheless, I enjoyed very much looking at the 2016 elections data and came up with some interesting points that I was not aware of.